This paper proposes a self-supervised approach to learn universal facial representations from videos, that can transfer across a variety of facial analysis tasks such as Facial Attribute Recognition (FAR), Facial Expression Recognition (FER), DeepFake Detection (DFD), and Lip Synchronization (LS). Our proposed framework, named MARLIN, is a facial video masked autoencoder, that learns highly robust and generic facial embeddings from abundantly available non-annotated web crawled facial videos. As a challenging auxiliary task, MARLIN reconstructs the spatio-temporal details of the face from the densely masked facial regions which mainly include eyes, nose, mouth, lips, and skin to capture local and global aspects that in turn help in encoding generic and transferable features. Through a variety of experiments on diverse downstream tasks, we demonstrate MARLIN to be an excellent facial video encoder as well as feature extractor, that performs consistently well across a variety of downstream tasks including FAR (1.13% gain over supervised benchmark), FER (2.64% gain over unsupervised benchmark), DFD (1.86% gain over unsupervised benchmark), LS (29.36% gain for Frechet Inception Distance), and even in low data regime. Our codes and pre-trained models will be made public.
translated by 谷歌翻译
We present a unified formulation and model for three motion and 3D perception tasks: optical flow, rectified stereo matching and unrectified stereo depth estimation from posed images. Unlike previous specialized architectures for each specific task, we formulate all three tasks as a unified dense correspondence matching problem, which can be solved with a single model by directly comparing feature similarities. Such a formulation calls for discriminative feature representations, which we achieve using a Transformer, in particular the cross-attention mechanism. We demonstrate that cross-attention enables integration of knowledge from another image via cross-view interactions, which greatly improves the quality of the extracted features. Our unified model naturally enables cross-task transfer since the model architecture and parameters are shared across tasks. We outperform RAFT with our unified model on the challenging Sintel dataset, and our final model that uses a few additional task-specific refinement steps outperforms or compares favorably to recent state-of-the-art methods on 10 popular flow, stereo and depth datasets, while being simpler and more efficient in terms of model design and inference speed.
translated by 谷歌翻译
人类姿势预测是一个充满挑战的问题,涉及复杂的人体运动和姿势动态。在环境中有多个人的情况下,一个人的运动也可能受到他人的运动和动态运动的影响。尽管以前有一些针对多人动态姿势预测问题的作品,但它们通常将整个姿势序列作为时间序列(忽略关节之间的基本关系)建模,或者仅一次输出一个人的未来姿势序列。在本文中,我们提出了一种新方法,称为社会运动变压器(SOM形态),用于多人3D姿势预测。我们的变压器架构独特地将人类运动输入作为关节序列而不是时间序列建模,从而使我们能够对关节进行注意,同时预测并联每个关节的整个未来运动序列。我们表明,通过这种问题重新进行,Somoformer自然会通过使用场景中所有人的关节作为输入查询来扩展到多人场景。我们的模型使用学识渊博的嵌入来表示关节,人身份和全球地位的类型,了解关节之间和人之间的关系,更强烈地参加了来自同一或附近的人的关节。 Somoformer的表现优于SOMOF基准以及CMU-MOCAP和MUPOTS-3D数据集的长期运动预测的最先进方法。出版后将提供代码。
translated by 谷歌翻译
我们呈现PIFENET,一种高效准确的实时3D探测器,用于点云的行人检测。我们解决了在检测行人时遇到的3D对象检测框架的两个挑战:Partion云中的柱特征的表达力量和小型行人的小占领区。首先,我们引入了一个可堆叠的柱子感知注意力(PAA)模块,用于增强的柱子特征提取,同时抑制点云中的噪声。通过将多点感知池,点亮,通道和任务感知注意与到一个简单的模块集成到一个简单的模块,在需要几乎额外的计算资源的同时提高表示功能。我们还存在Mini-Bifpn,一个小而有效的特征网络,创建双向信息流和多级串尺度特征融合,以更好地集成多分辨率功能。我们的方法在Kitti Peistrian Bev和3D排行榜中排名第一,同时以每秒26帧(FPS)运行,并在Nuscenes检测基准上实现最先进的性能。
translated by 谷歌翻译
基于学习的光流量估计已经与成本量的管道管道,具有用于流回归的卷曲,其固有地限于本地相关性,因此很难解决大型位移的长期挑战。为了缓解这一点,通过大量迭代细化产生一系列流动更新,实现最先进的方法,即筏,逐渐提高其预测的质量,实现了显着的性能,但减慢推理速度。为了实现高精度和效率的光学流量估计,我们通过将光学流作为全球匹配问题重新重新重新重新匹配,完全改造主导流回归管道。具体而言,我们提出了一个GMFlow框架,它由三个主要组件组成:用于功能增强的自定义变压器,全局特征匹配的相关和软邮件,以及用于流传播的自我注意层。此外,我们进一步介绍了一种改进步骤,该步骤在较高分辨率下重复使用GMFlow以进行残余流量预测。我们的新框架优于32次迭代RAFT在挑战的Sintel基准测试中的性能,同时仅使用一个细化并更快地运行,为高效和准确的光学流量估算提供了新的可能性。代码将在https://github.com/haofeixu/gmflow上使用。
translated by 谷歌翻译
大规模视频操作的可用性了解数据集在解释包含人员的视觉场景的解释方面有助于进步。然而,学会识别人类的行为和他们在包括众多人的不受约束的现实环境中的社交互动,具有来自移动机器人平台捕获的感官数据流的潜在高度不平衡和长尾的分布式动作标签仍然是一个重大挑战,由于缺乏反射性大型数据集而不是。在本文中,我们介绍了JRDB-ACT,作为现有JRDB的延伸,由社交移动机械手捕获,并反映了大学校园环境中的人类日常生活行为的真正分布。 JRDB-ACT浓密地用原子动作注释,包括超过2.8M的动作标签,构成了大规模的时空动作检测数据集。每个人的边界盒用一个基于姿势的动作标签和多个基于〜(可选)的基于交互的动作标签标记。此外,JRDB-ACT提供社会团体注释,有助于根据其在现场的互动来推断他们的社会活动〜(每个社会群体的共同活动)进行分组个人的任务。 JRDB-ACT中的每个注释标签都标有注释器的置信水平,这有助于开发可靠的评估策略。为了演示如何有效地利用这种注释,我们开发了端到端的培训管道,以学习和推断这些任务,即个人行动和社会群体检测。数据和评估代码在https://jrdb.erc.monash.edu/上公开可用。
translated by 谷歌翻译
本文研究了涉及对象集,对象检测,实例级分段和多对象跟踪的基本视觉任务的性能评估标准。现有标准的算法排名可能会以不同的参数选择波动,例如联合(IOU)阈值的交叉点使他们的评估不可靠。更重要的是,没有能够验证我们是否可以相信标准的评估。这项工作提出了对性能标准的可信赖性的概念,该概念需要(i)对可靠性的参数鲁棒性,(ii)理智测试中的上下文意义,以及(iii)与数学要求(例如度量属性)的一致性。我们观察到这些要求被许多广泛使用的标准忽略了,并使用一组形状的指标探索替代标准。我们还根据建议的可信度要求评估所有这些标准。
translated by 谷歌翻译
Intersection over Union (IoU) is the most popular evaluation metric used in the object detection benchmarks. However, there is a gap between optimizing the commonly used distance losses for regressing the parameters of a bounding box and maximizing this metric value. The optimal objective for a metric is the metric itself. In the case of axisaligned 2D bounding boxes, it can be shown that IoU can be directly used as a regression loss. However, IoU has a plateau making it infeasible to optimize in the case of nonoverlapping bounding boxes. In this paper, we address the weaknesses of IoU by introducing a generalized version as both a new loss and a new metric. By incorporating this generalized IoU (GIoU ) as a loss into the state-of-the art object detection frameworks, we show a consistent improvement on their performance using both the standard, IoU based, and new, GIoU based, performance measures on popular object detection benchmarks such as PASCAL VOC and MS COCO.
translated by 谷歌翻译
Research on automated essay scoring has become increasing important because it serves as a method for evaluating students' written-responses at scale. Scalable methods for scoring written responses are needed as students migrate to online learning environments resulting in the need to evaluate large numbers of written-response assessments. The purpose of this study is to describe and evaluate three active learning methods than can be used to minimize the number of essays that must be scored by human raters while still providing the data needed to train a modern automated essay scoring system. The three active learning methods are the uncertainty-based, the topological-based, and the hybrid method. These three methods were used to select essays included as part of the Automated Student Assessment Prize competition that were then classified using a scoring model that was training with the bidirectional encoder representations from transformer language model. All three active learning methods produced strong results, with the topological-based method producing the most efficient classification. Growth rate accuracy was also evaluated. The active learning methods produced different levels of efficiency under different sample size allocations but, overall, all three methods were highly efficient and produced classifications that were similar to one another.
translated by 谷歌翻译
Osteoarthritis (OA) is the most prevalent chronic joint disease worldwide, where knee OA takes more than 80% of commonly affected joints. Knee OA is not a curable disease yet, and it affects large columns of patients, making it costly to patients and healthcare systems. Etiology, diagnosis, and treatment of knee OA might be argued by variability in its clinical and physical manifestations. Although knee OA carries a list of well-known terminology aiming to standardize the nomenclature of the diagnosis, prognosis, treatment, and clinical outcomes of the chronic joint disease, in practice there is a wide range of terminology associated with knee OA across different data sources, including but not limited to biomedical literature, clinical notes, healthcare literacy, and health-related social media. Among these data sources, the scientific articles published in the biomedical literature usually make a principled pipeline to study disease. Rapid yet, accurate text mining on large-scale scientific literature may discover novel knowledge and terminology to better understand knee OA and to improve the quality of knee OA diagnosis, prevention, and treatment. The present works aim to utilize artificial neural network strategies to automatically extract vocabularies associated with knee OA diseases. Our finding indicates the feasibility of developing word embedding neural networks for autonomous keyword extraction and abstraction of knee OA.
translated by 谷歌翻译